Skip to content

ARROW-14822: [C++] Implement floor/ceil/round for temporal objects#11818

Closed
rok wants to merge 11 commits into
apache:masterfrom
rok:ARROW-14822
Closed

ARROW-14822: [C++] Implement floor/ceil/round for temporal objects#11818
rok wants to merge 11 commits into
apache:masterfrom
rok:ARROW-14822

Conversation

@rok

@rok rok commented Nov 30, 2021

Copy link
Copy Markdown
Member

This is to resolve ARROW-14822.

@github-actions

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

⚠️ Ticket has not been started in JIRA, please click 'Start Progress'.

Comment thread cpp/src/arrow/compute/api_scalar.h Outdated
Comment thread cpp/src/arrow/compute/api_scalar.cc Outdated
@rok rok force-pushed the ARROW-14822 branch 4 times, most recently from c986aeb to 5b1fe4b Compare December 6, 2021 10:50
@rok rok force-pushed the ARROW-14822 branch 2 times, most recently from 0988d48 to 0814df6 Compare December 14, 2021 23:37
@rok

rok commented Dec 14, 2021

Copy link
Copy Markdown
Member Author

This is getting review ready, couple of issues remaining:

  • RoundTemporal, CeilTemporal and FloorTemporal could maybe be templated from a single struct
  • Nonexistent in RoundTemporalOptions might be unnecessary as we assume timestamps are in UTC so they always exist. Probably holds for Ambiguous too.
  • Test data needs some work, especially pre-1970 and edge-cases.

@rok

rok commented Dec 14, 2021

Copy link
Copy Markdown
Member Author

@lidavidm @jorisvandenbossche could you take a quick glance to see if something needs to be done fundamentally different here? If not I hope to have this review ready tomorrow.

@lidavidm

Copy link
Copy Markdown
Member
  • Nonexistent in RoundTemporalOptions might be unnecessary as we assume timestamps are in UTC so they always exist. Probably holds for Ambiguous too.

Would we want to eventually support rounding in a timezone? (That can be deferred since I can see that getting tricky.) e.g. rounding to the hour may differ in a timezone that is xx:30 offset from UTC.

@rok

rok commented Dec 14, 2021

Copy link
Copy Markdown
Member Author
  • Nonexistent in RoundTemporalOptions might be unnecessary as we assume timestamps are in UTC so they always exist. Probably holds for Ambiguous too.

Would we want to eventually support rounding in a timezone? (That can be deferred since I can see that getting tricky.) e.g. rounding to the hour may differ in a timezone that is xx:30 offset from UTC.

I've been thinking about that exact situation (xx:30) and I'm not sure yet how to tackle it but I'd like to in this ticket.

@lidavidm

Copy link
Copy Markdown
Member

I only took a quick scan but I don't see anything fundamentally wrong.

  • It gets confusing with all the chrono types but we should stick to UpperCamelCase for Arrow functions/methods
  • Do we want QUARTER in addition to SEASON?

@rok

rok commented Dec 15, 2021

Copy link
Copy Markdown
Member Author
  • Do we want QUARTER in addition to SEASON?

I had it originally but it felt redundant as it's just RoundTemporal(arg, RoundTemporalOptions(unit=month, multiple=3)). I'll add it back as it will probably get requested.

@lidavidm

Copy link
Copy Markdown
Member

Ah, that is true - we could just document it.

@rok rok marked this pull request as ready for review December 21, 2021 12:04
@rok

rok commented Dec 21, 2021

Copy link
Copy Markdown
Member Author

@lidavidm - I think this is ready for first round of review.
I've decided to remove Nonexistent and Ambiguous.
I'm working out some issues with rounding multiple seasons but other than that I think it's good to review. I'll also try adding origin to set the "zero time" in one of the future commits.

@rok

rok commented Dec 21, 2021

Copy link
Copy Markdown
Member Author

@pitrou any thoughts on this?

@lidavidm lidavidm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me, taking into account the TODOs.

Comment thread cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc Outdated
Comment thread cpp/src/arrow/compute/api_scalar.h Outdated
Comment thread python/pyarrow/_compute.pyx Outdated
Comment thread python/pyarrow/tests/test_compute.py Outdated
Comment thread cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc Outdated
Comment thread cpp/src/arrow/compute/api_scalar.h Outdated
@jorisvandenbossche

Copy link
Copy Markdown
Member

I've decided to remove Nonexistent and Ambiguous.

(didn't yet look at the code) Did you then also remove support rounding timestamps with a timezone? Or how does rounding tz-aware timestamps work?

It's also not really clear to me what origin would do. Can you explain with an example what it would do? (I also don't see it in lubridate or pandas?)

@rok

rok commented Dec 22, 2021

Copy link
Copy Markdown
Member Author

I've decided to remove Nonexistent and Ambiguous.

(didn't yet look at the code) Did you then also remove support rounding timestamps with a timezone? Or how does rounding tz-aware timestamps work?

At the moment some of the code (ns, us, ms, s, m, h, d, w) is UTC based even when timezones are present. UTC always converts to a local time so nonexistent is not an issue while ambiguous could be.
Month, quarter, year are currently rounded in local time and integer multiples of those "units" are less problematic because beginnings of months seem to always exist. Perhaps this needs more study.

It's also not really clear to me what origin would do. Can you explain with an example what it would do? (I also don't see it in lubridate or pandas?)

Origin is meant like the origin in Euclidean space. Let's say you want to floor today's date to 100year:

floor("2021-12-22", unit="year", multiple=100, origin="1970-01-01") == "1970-01-01"
floor("2021-12-22", unit="year", multiple=100, origin="0-01-01") == "2000-01-01"

R's clock has this concept.
It's also a "replacement" for week_starts_monday.
e.g.: 1970-01-01 was a Thursday so floor(t, unit="week", multiple=1, origin="1970-01-05") is equivalent to floor(t, unit="week", multiple=1, week_starts_monday=True)

@rok

rok commented Dec 23, 2021

Copy link
Copy Markdown
Member Author

I think tests are now mostly in order. Remaining work:

  • origin
  • timezone handling (nonexistent, ambiguous)
  • shorthand

Am I forgetting something?

Which of these could be pushed out of this scope? I'm thinking shorthand and perhaps perhaps timezone handling.
@lidavidm @jorisvandenbossche

@rok rok force-pushed the ARROW-14822 branch 2 times, most recently from b241c3c to e1e55d0 Compare December 23, 2021 11:11
@rok rok force-pushed the ARROW-14822 branch 3 times, most recently from 4429c20 to d84408d Compare January 1, 2022 17:43
@rok rok force-pushed the ARROW-14822 branch 2 times, most recently from ed1f711 to b93f075 Compare January 3, 2022 01:25
@rok

rok commented Jan 3, 2022

Copy link
Copy Markdown
Member Author

@lidavidm @jorisvandenbossche this now rounds time since epoch in local time (see here). It works ok outside of DST and discussible within. It's a different approach to what e.g. Pandas does.
I'm going to look into a better approach for this.

Comment thread python/pyarrow/tests/test_compute.py Outdated
Comment thread python/pyarrow/tests/test_compute.py Outdated
timestamps = [
# "1899-04-18 01:57:09.190202880",
# "1899-09-12 07:03:30.080325120",
# "1904-06-21 20:55:36.493869056",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these still meant to be commented out?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some odd changes in timezone offsets that are causing problems. I'll try changing ceil to match behaviour.

@rok rok Jan 4, 2022

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the kind of error I'm seeing for these three timestamps:

actual:

Timestamp('1899-04-19 00:31:50+0553', tz='Asia/Kolkata')
Timestamp('1899-09-13 00:31:50+0553', tz='Asia/Kolkata')
Timestamp('1904-06-22 23:59:50+0521', tz='Asia/Kolkata')

expected:

Timestamp('1899-04-19 00:00:00+0553', tz='Asia/Kolkata'),
Timestamp('1899-09-13 00:00:00+0553', tz='Asia/Kolkata'),
Timestamp('1904-06-23 00:00:00+0521', tz='Asia/Kolkata')

@rok rok Jan 4, 2022

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calcutta time was one of the two time zones established in British India in 1884. It was established during the International Meridian Conference held at Washington, D.C. in the United States. It was decided that India had two time zones: Calcutta (now Kolkata) would use the 90th meridian east and Bombay (now Mumbai) the 75th meridian east. It was determined as 5 hours, 53 minutes and 20 seconds ahead of Greenwich Mean Time(UTC+5:53:20).

https://en.wikipedia.org/wiki/Calcutta_Time

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh boy. So, if I'm reading this right, it's rounding using the "modern" TZ definition instead of the "contemporary" one? Maybe we should just not test this case then, since this is down to whatever the timezone database for a system contains…? Or am I misunderstanding.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's that but I'm not 100% and I really want to be :))

@rok rok Jan 4, 2022

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I mislabeled that (switched actual and expected, fixed now). It appears my rounding in local time is not done correctly.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm back to thinking Pandas only uses "modern" TZs while date.h uses a historical database.
I've added a test for this in c++ and removed it from Python.

Comment thread cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc Outdated
Comment thread cpp/src/arrow/compute/kernels/scalar_temporal_test.cc Outdated
@rok rok force-pushed the ARROW-14822 branch 3 times, most recently from 4be4c66 to e2b6d9f Compare January 4, 2022 19:35

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: for context, can we note that this test is here to test the a case where date (and hence our kernels) uses historical TZ info while Pandas (and possibly other libraries) do not?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise I fear confusion when someone in the future looks and wonders why Kolkata was an especially important time zone to test :)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that makes sense.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added.

It seems reality is more nuanced than Wikipedia states.

@lidavidm

lidavidm commented Jan 4, 2022

Copy link
Copy Markdown
Member

Oh, and we need to add this to the python api.rst as well.

@rok rok force-pushed the ARROW-14822 branch 2 times, most recently from c7cf795 to e8f9a62 Compare January 4, 2022 20:10
@lidavidm lidavidm closed this in edab145 Jan 4, 2022
@rok

rok commented Jan 4, 2022

Copy link
Copy Markdown
Member Author

Thanks @lidavidm & @jorisvandenbossche ! :)

Follow-up Jiras:

@ursabot

ursabot commented Jan 5, 2022

Copy link
Copy Markdown

Benchmark runs are scheduled for baseline = acce03b and contender = edab145. edab145 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Finished ⬇️0.45% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.09% ⬆️0.04%] ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants